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Abstract 

The hyperclimbing hypothesis is a hypothetical explanation for adaptation in genetic 
algorithms with uniform crossover (UGAs). Hyperclimbing is an intuitive, general-purpose, 
non-local search heuristic applicable to discrete product spaces with rugged or stochastic 
cost functions. The strength of this heuristic lies in its insusceptibility to local optima 
when the cost function is deterministic, and its tolerance for noise when the cost function is 
stochastic. Hyperclimbing works by decimating a search space, i.e. by iteratively fixing the 
values of small numbers of variables. The hyperclimbing hypothesis holds that UGAs work 
by implementing efficient hyperclimbing. Proof of concept for this hypothesis comes from 
the use of a novel analytic technique involving the exploitation of algorithmic symmetry. 
We have also obtained experimental results that show that a simple tweak inspired by 
the hyperclimbing hypothesis dramatically improves the performance of a UGA on large, 
random instances of MAX-3SAT and the Sherrington Kirkpatrick Spin Glasses problem. 

1. Introduction 

Over several decades of use in diverse scientific and engineering fields, evolutionary opti- 
mization has acquired a reputation for being a kind of universal acid — a general purpose 
approach that routinely procures useful solutions to optimization problems with rugged, 
dynamic, and stochastic cost functions over search spaces consisting of strings, vectors, 
trees, and instances of other kinds of data structures (Fogel, 2006). Remarkably, the means 
by which evolutionary algorithms work is still the subject of much debate. An abiding mys- 
tery of the field is the widely observed utility of genetic algorithms with uniform crossover 
(Syswerda, 1989; Rudnick et al., 1994; Pehkan, 2008; Huifang and Mo, 2010). The use of 
uniform crossover (Ackley, 1987; Syswerda, 1989) in genetic algorithms causes genetic loci 
to be unlinked, i.e. recombine freely. It is generally acknowledged that the adaptive capac- 
ity of genetic algorithms with this kind of crossover cannot be explained within the rubric of 
the building block hypothesis, the reigning explanation for adaptation in genetic algorithms 
with strong linkage between loci (Goldberg, 2002). Yet, no alternate, scientifically rigorous 
explanation for adaptation in genetic algorithms with uniform crossover (UGAs) has been 
proposed. The hyperclimbing hypothesis, presented in this paper, addresses this gap. This 
hypothesis holds that UGAs perform adaptation by implicitly and efficiently implementing 
a global search heuristic called hyperclimbing. 

If the hyperclimbing hypothesis is sound, then the UGA is in good company. Hyper- 
climbing belongs to a class of heuristics that perform global decimation. Global decimation, 
it turns out, is the state of the art approach to solving large, hard instances of SAT (Kroc 



et al., 2009). Conventional global decimation strategies — e.g. Survey Propagation (Mezard 
et al., 2002), Belief Propagation, Warning Propagation (Braunstein et al., 2002) — use mes- 
sage passing algorithms to obtain statistical information about the space being searched. 
This information is then used to fix the values of one, or a small number, of search space 
attributes, effectively reducing the size of the search space. The decimation strategy is then 
recursively applied to the smaller search space. And so on. Survey Propagation, perhaps the 
best known global decimation strategy, has been used along with Walksat (Selman et al., 
1993) to solve instances of SAT with upwards of a million variables. The hyperclimbing 
hypothesis holds that in practice, UGAs also perform adaptation by decimating the search 
spaces to which they are applied. Unlike conventional decimation strategies, however, a 
UGA obtains statistical information about the search space implicitly, by means other than 
message passing. 

The rest of this paper is organized as follows: Section 2 provides an informal description 

of the hyperclimbing heuristic. A more formal description appears in Section A of the online 
appendix. Section 3, presents proof of concept, i.e. it describes a stochastic fitness function^ 
on which a UGA behaves as described in the hyperclimbing hypothesis. Exploiting certain 
symmetries inherent within uniform crossover and a containing class of fitness functions, we 
argue that the adaptive capacity of a UGA scales extraordinarily well as the size of the search 
space increases. We follow up with experimental tests that validate this conclusion. One 
way for the hyperclimbing hypothesis to gain credibility is by inspiring modifications to the 
genetic algorithm that improve performance. Section 4 presents the results of experiments 
that show that a simple tweak called clamping, inspired by the hyperclimbing hypothesis, 
dramatically improves the performance of a genetic algorithm on large, randomly generated 
instances of MAX-3SAT, and the Sherrington Kirkpatric Spin Glasses problem. While not 
conclusive, this validation does lend considerable support to the hyperclimbing hypothesis^. 
Wc conclude in Section 5 with a brief discussion of the gcneralizability of the hyperclimbing 
hypothesis and its ramifications for Evolutionary Computation and Evolutionary Biology. 

2. The Hyperclimbing Heuristic 

For a sketch of the workings of a hyperclimbing heuristic, consider a search space S = {0, 1}^, 
and a (possibly stochastic) fitness function that maps points in S to real values. Let us 
define the order of a schema partition Mitchell (1996) to simply be the order of the schemata 
that comprise the partition. Clearly then, schema partitions of lower order are coarser 
than schema partitions of higher order. The effect of a schema partition is defined to be 
the variance of the expected fitness of the constituent schemata under sampling from the 
uniform distribution over each schema. So for example, the effect of the schema partition 
^ * >i<:^ * * = |o * *0 * *, * *1 * *, 1 * *0 * *, 1 * *1 * *} is 



1. A fitness function is notliing but a cost function with a small twist: the goal is, not to minimize fitness, 
but to maximize it. 

2. Then again, no scientific theory can be conclusively validated. The best one can hope for is pursuasive 
forms of validation (Popper, 2007b, a). 




i=0 j=0 



2 



where the operator F gives the expected fitness of a schema under sampUng from the uniform 
distribution. A hyperchmbing heuristic starts by sampling from the uniform distribution 
over the entire search space. It subsequently identifies a coarse schema partition with a 
non-zero effect, and hmits future sampling to a schema in this partition with above average 
expected fitness. In other words the hyperchmbing heuristic fixes the defining bits Mitchell 
(1996) of this schema in the population. This schema constitutes a new (smaller) search 
space to which the hyperchmbing heuristic is recursively applied. Crucially, the act of fixing 
defining bits in a population has the potential to "generate" a detectable non-zero effect in a 
schema partition that previously had a negligible effect. For example, the schema partition 
*^ * * * ^ can have a negligible effect, while the schema partition 1^ *0*^ has a detectable 
non-zero effect. A more formal description of the hyperchmbing heuristic can be found in 
Appendix A. 

At each step in its progression, hyperchmbing is sensitive, not to the fitness value of any 
individual point, but to the sampling means of relatively coarse schemata. This heuristic 
is, therefore, natively able to tackle optimization problems with stochastic cost functions. 
Considering the intuitive simplicity of hyperchmbing, this heuristic has almost certainly 
been toyed with by other researchers in the general field of discrete optimization. In all 
likelihood it was set aside each time because of the seemingly high cost of implementation 
for all but the smallest of search spaces or the coarsest of schema partitions. Given a search 
space comprised by £ binary variables, there are (^) schema partitions of order o. For any 
fixed value of o, (^) G (Cormen et al., 1990). The exciting finding presented in this 

paper is that UGAs can implement hyperchmbing cheaply for large values of £, and values 
of o that are small, but greater than one. 

3. Proof of Concept 

We introduce a parameterized stochastic fitness function, called a staircase function, and 
provide experimental evidence that a UGA can perform hyperchmbing on a particular 
parameterization of this function. Then, using symmetry arguments, we conclude that 
the running time and the number of fitness queries required to achieve equivalent results 
scale surprisingly well with changes to key parameters. An experimental test validates this 
conclusion. 

Definition 1 A staircase function descriptor is a 6-tuple {h, o, 5, £, L, V) where h, a and i 
are positive integers such that ho < £, S is a positive real number, and L and V are matrices 
with h rows and o columns such that the values of V are binary digits, and the elements of 
L are distinct integers in [i] . 

For any positive integer £, let [£] denote the set {!,...,£}, and let denote the set of 
binary strings of length £. Given any fc-tuple, x, of integers in [£], and any binary string 
g G 55^, let '^xig) denote the string 6i, . . . , 6^ such that for any i £ [k], bi = gx^- For any 
m X n matrix M, and any i G [m], let Mj: denote the n-tuple that is the i*'* row of M. Let 
M{a, b) denote the normal distribution with mean a and variance b. Then the function, /, 
described by the staircase function descriptor {h, a, S, £, L, V) is the stochastic function over 
the set of binary strings of length i given by Algorithm 1. The parameters h, o, 6, and £ are 
called the height, order, increment and span, respectively, of /. For any i G [h], we define 
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Algorithm 1: 

A staircase function with descriptor {h, o, S, a, £, L, V) 
Input: 5 is a chromosome of length i 

X some value drawn from the distribution A^(0, 1) 
for i 1 to /i do 

if El,, [g] = Vi\... Vio then 

I X X + 5 

else 

x-{8l{2° -\)) 

break 
end 

end 

return x 



step i of / to be the schema {g G *8^|Si.. (g) = Vii . . . Vio}, and define stage z of / to be the 
schema {g G ^^1(^1^X9) = Vu . . .Via) A ■ . . A {^lM = "^ii • • • Via)}. 

A step of the staircase function is said to have been climbed when future sampling of 
the search space is largely limited to that step. Just as it is hard to climb higher steps of 
a physical staircase without climbing lower steps first, it is computationally expensive to 
identify higher steps of a staircase function without identifying lower steps first (Theorem 
1, Appendix C). In this regard, it is possible that staircase functions capture a feature that 
is widespread within the fitness functions resulting from the representational choices of GA 
users. The difficulty of climbing step i G [h] given stage i — 1, however, is non- increasing 
with respect to i (Corollary 1, Appendix C). Readers seeking to ways to visualize staircase 
functions are refered to Appendix B. 

3.1 UGA Specification 

The pseudocode for the UGA used in this paper is given in Algorithm 2. The free parameters 
of the UGA are N (the size of the population), p„i (the per bit mutation probability), and 
EvALUATE-FlTNESS (the fitness function). Once these parameters are fixed, the UGA is 
fully specified. The specification of a fitness function implicitly determines the length of 
the chromosomes, £. Two points deserve further elaboration: 

1. The function SUS-Selection takes a population of size N, and a corresponding set 
of fitness values as inputs. It returns a set of parents drawn by fitness proportion- 
ate stochastic universal sampling (SUS). Instead of selecting N parents by spinning 
a roulette wheel with one pointer N times, stochastic universal sampling selects N 
parents by spinning a roulette wheel with equally spaced pointers just once. Select- 
ing parents this way has been shown to reduce sampling error (Baker, 1985; Mitchell, 
1996). 

2. When selection is fitness proportionate, an increase in the average fitness of the pop- 
ulation causes a decrease in selection pressure. The UGA in Algorithm 2 combats 
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this effect by using sigma scaling (Mitchell, 1996, p 167) to adjust the fitness values 
returned by EvALUATE-FlTNESS. These adjusted fitness values, not the raw ones, are 
used when selecting parents. Let fx^ denote the raw fitness of some chromosome x 
in some generation t, and let and u^*^ denote the mean and standard deviation of 
the raw fitness values in generation t respectively. Then the adjusted fitness of x in 
generation t is given by hx'' where, if cr^*) = then hx^ = 1, otherwise, 



/jW = min(0, 1 + ^— 



fit). 



At) ' 

The use of sigma scaling also entails that negative fitness values are handled appro- 
priately. 



3.2 Performance of a UGA on a class of Staircase Functions 

Let / be a staircase function with descriptor (/i, o, (5, £, L, F), we say that / is basic if ^ = /io, 
Ljj = o{i — 1) + j, (i.e. if L is the matrix of integers from 1 to ho laid out row-wise), and V is 
a matrix of ones. If / is known to be basic, then the last three elements of the descriptor of 
/ are fully determinable from the first three, and its descriptor can be shortened to (/i, o, S). 
Given some staircase function / with descriptor (^, o, 5, L, F), we define the basic form of 
/ to be the (basic) staircase function with descriptor {h,o,S). 

Let (p* be the basic staircase function with descriptor (h = 50, o = 4, (5 = 0.3), and 
let U denote the UGA defined in section 3.1 with a population size of 500, and a per bit 
mutation probability of 0.003 (i.e, Pm = 0.003). Figure la shows that U is capable of 
robust adaptation when applied to c/)* (We denote the resulting algorithm hy U'f' ). Figure 
Ic shows that under the action of U, the first four steps of 0* go to fixation'^ in ascending 
order. When a step gets fixed, future sampling will largely be confined to that step — in 
effect, the hyperplane associated with the step has been climbed. Note that the UGA does 
not need to "fully" climb a step before it begins climbing the subsequent step (Figure Ic). 

3.3 Symmetry Analysis and Experimental Confirmation 

Formal models of SGAs with finite populations and non-trivial fitness functions (Nix and 
Vose, 1992), are notoriously unwieldy (Holland, 2000), which is why most theoretical anal- 
yses of SGAs assume an infinite population (Licpins and Vose, 1992: Stephens and Wael- 
broeck, 1999; Wright et al., 2003; Burjorjcc, 2007). Unfortunately, since the running time 
and the number of fitness evaluations required by such models is always infinite, their use 
precludes the identification of computational efficiencies of the SGA. In the present case, we 
circumvent the difficulty of formally analyzing finite population SGAs by exploiting some 
simple symmetries introduced through our definition of staircase functions, and through our 
use of a crossover operator with no positional bias. The absence of positional bias in uni- 
form crossover was highlighted by Eshelman et al. (1989). Essentially, permuting the bits 

3. The terms 'fixation' and 'fixing' are used loosely here. Clearly, as long as the mutation rate is non-zero, 
no locus can ever be said to go to fixation in the strict sense of the word. 
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Algorithm 2: Pseudocode for the UGA used. The population size is an even num- 
ber, denoted A'^, the length of the chromosomes is ^, and for any chromosomal bit, the 
probability that the bit will be flipped during mutation (the per bit mutation proba- 
bility) is Pm- The population is represented internally as an by ^ array of bits, with 
each row representing a single chromosome. Generate-UX-Masks(x, y) creates an 
X by y array of bits drawn from the uniform distribution over {0, 1}. Generate- 
Mut-Masks(x, y, z) returns an x by y array of bits such that any given bit is 1 with 
probability z. 
pop <— Initialize-Population(A^,^) 
while some termination condition is unreached do 
fitnessValues ^ EvALUATE-FlTNESS(pop) 
adjustedFitVals SlGMA-SCALE{fitnessValues) 
parents SUS-SELECTlON(pop, adjustedFitVals) 
crossMasks -^GENERATE-UX-MASKs(Ar/2, i) 
for f ^ i to N/2 do 
for j 1 to £ do 

if crossMasks[i,j] = then 
newPop[i, j] ^ parents[i, j] 
newPop[i + N/2,j] ■<— parents[i + N/2,j] 
else 

neujPop[i, j] ^ parents[i + N/2,j] 
newPop[i + N/2,j] ■<— parents[i, j] 
end 
end 
end 

mutMasks ■(-Generate-Mut-Masks(A/", ^, pm) 
for i 1 to N do 

for j <— i to £ do 
I newPop[i, j] xOR{newPop[i, j], mutMasks[i, j]) 

end 
end 

pop newPop 
end 



of all strings in each generation using some permutation vr before crossover, and permuting 
the bits back using -k~^ after crossover has no effect on the dynamics of a UGA. Another 
way to elucidate this symmetry is by noting that any homologous crossover operator can 
be modeled as a string of binary random variables. Only in the case of uniform crossover, 
however, are these random variables all independent and identically distributed. 

It is easily seen that loci that are not part of any step of a staircase function are 
immaterial during fitness evaluation. The absence of positional bias in uniform crossover 
entails that such loci can also be ignored during recombination. Effectively, then, these loci 
can be "spliced out" without affecting the expected average fitness of the population in any 
generation. This, and other observations of this type lead to the conclusion below. 



6 





Figure 1: (a) The mean, across 20 trials, of the average fitness of the population of V^* i 
in each of 5000 generations. The error bars show five standard errors above and 
below the mean every 200 generations, (c) Going from the top plot to the bottom 
plot, the mean frequencies, across 20 trials, of the first four steps of the staircase 
function in each of the first 250 generations. The error bars show three 
standard errors above and below the mean every 12 generations. (b,d) Same as 
the plots on the left, but for 



Let W be some UGA. For any staircase function /, and any x G [0, 1], let p^y^f 

denote the probability that the frequency of stage i of / in generation t of is x. Let /* 
be the basic form of /. Then, by appreciating the symmetries between the UGAs W^* and 
one can conclude the following: 
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Conclusion 1 For any generation t, any i G [h], and any x G [Ojl]; p\wf i)^^^ ~ 

This conclusion straightforwardly entails that to raise the average fitness of a population 

to some attainable value, 

1. The expected number of generations required is constant with respect to the span of 
a staircase function 

2. The running time required scales linearly with the span of a staircase function 

3. The running time and the number of generations are unaffected by the last two ele- 
ments of the descriptor of a staircase function 

Let / be some staircase function with basic form (j)* (defined in Section 3.2). Then, 
given the above, the application of C/ to / should, discounting deviations due to sampling, 
produce results identical to those shown in Figures la and Ic. We validated this "corollary" 
by applying U to the staircase function ^ with descriptor {h = 50,o = 4,S = 0.3, £ = 
20000, L, V) where L and V were randomly generated. The results are shown in Figures lb 
and Id. Note that gross changes to the matrices L and V, and an increase in the span of the 
staircase function by two orders of magnitude did not produce any statistically significant 
changes. It is hard to think of another algorithm with better scaling properties on this 
non-trivial class of fitness functions. 

4. Validation 

Let us pause to consider a curious aspect of the behavior of U'^'* . Figure 1 shows that the 
growth rate of the average fitness of the population of decreases as evolution proceeds, 
and the average fitness of the population plateaus at a level that falls significantly short 
of the maximum expected average population fitness of 15. As discussed in the previous 
section, the difficulty of climbing step i given stage f — 1 is non-increasing with respect to 
i. So, given that U successfully identifies the first step of (/)*, why does it fail to identify all 
remaining steps? To understand why, consider some binary string that belongs to the i*^ 
stage of <p*. Since the mutation rate of U is 0.003, the probability that this binary string 
will still belong to stage i after mutation is 0.997*°. This entails that as i increases, IS 
less able to "hold" a population within stage i. In light of this observation, one can infer 
that as i increases the sensitivity of U to the conditional fitness signal of step i given stage 
i — 1 will decrease. This loss in sensitivity explains the decrease in the growth rate of the 
average fitness of . We call the "wastage" of fitness queries described here mutational 
drag. 

To curb mutational drag in UGAs, we conceived of a very simple tweak called clamping. 
This tweak relies on parameters f lagFreqThreshold G [0.5,1], unf lagFreqThreshold G 

[0.5, f lagFreqThreshold], and the positive integer waitingPeriod. If the one-frequency 
or the zero-frequency of some locus (i.e. the frequency of the bit 1 or the frequency 
of the bit 0, respectively, at that locus) at the beginning of some generation is greater 
than f lagFreqThreshold, then the locus is flagged. Once flagged, a locus remains 
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Figure 2: (Top) The mean (across 20 trials) of the average fitness of the UGA Uc on the 
staircase function <f)* . Errorbars show five standard errors above and below the 

mean every 200 generations. (Bottom) The mean (across 20 trials) of the number 
of loci left unmutated by the clamping mechanism. Errorbars show three standard 
errors above and below the mean every 200 generations 



flagged as long as the one-frequency or the zero-frequency of the locus is greater than 
unf lagFreqThreshold at the beginning of each subsequent generation. If a flagged locus 
in some generation t has remained constantly flagged for the last waitingPeriod genera- 
tions, then the locus is considered to have passed our fixation test, and is not mutated in 
generation t. This tweak is called clamping because it is expected that in the absence of 
mutation, a locus that has passed our fixation test will quickly go to strict fixation, i.e. the 
one-frequency, or the zero-frequency of the locus will get "clamped" at one for the remainder 
of the run. 

Let Uc denote a UGA that uses the clamping mechanism described above and is identical 
to the UGA U in every other way. The clamping mechanism used by Uc is parameterized as 
follows: f lagFreqThreshold = 0.99, unf lagFreqThreshold = 0.9, waitingPeriod=200. 
The performance of Uf is displayed in figure 2a. Figure 2b shows the number of loci that 
the clamping mechanism left unmutated in each generation. These two figures show that 
the clamping mechanism effectively allowed Uc to climb all the stages of 

If the hyperclimbing hypothesis is accurate, then mutational drag is likely to be an 
issue when UGAs are applied to other problems, especially large instances that require the 
use of long chromosomes. In such cases, the use of clamping should improve performance. 
We now present the results of experiments where the use of clamping clearly improves the 
performance of a UGA on large instances of MAX-3SAT and the Sherrington Kirkpatrik 
Spin Glasses problem. 
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4.1 Validation on MAX-3SAT 



MAX-ZcSAT (Hoos and Stiitzle, 2004) is one of the most extensively studied combinatorial 

optimization problems. An instance of this problem consists of n boolean variables, and m 
clauses. The literals of the instance are the n variables and their negations. Each clause is a 
disjunction of k of the total possible 2n literals. Given some MAX-ZcSAT instance, the value 
of a particular setting of the n variables is simply the number of the m clauses that evaluate 
to true. In a uniform, random MAX-fcSAT problem, the clauses are generated by picking 
each literal at random (with replacement) from amongst the 2n literals. Generated clauses 
containing multiple copies of a variable, and ones containing a variable and its negation, 
are discarded and replaced. 

Let Q denote the UGA defined in section 3.1 with a population size of 200 {N = 200) 
and a per bit mutation probability of 0.01 (i.e., p„i = 0.01). We applied Q to a randomly 
generated instance of the Uniform Random 3SAT problem, denoted sat, with 1000 binary 
variables and 4000 clauses. Variable assignments were straightforwardly encoded, with each 
bit in a chromosome representing the value of a single variable. The fitness of a chromosome 
was simply the number of clauses satisfied under the variable assignment represented. Figure 
3a shows the average fitness of the population of (5*°^* over 7000 generations. Note that the 
growth in the maximum and average fitness of the population tapered off by generation 



The UGA Q was applied to sat once again; this time, however, the clamping mech- 
anism described above was activated in generation 2000. The resulting UGA is de- 
noted Q^"*. The clamping parameters used were as follows: f lagFreqThreshold = 0.99, 
unf lagFreqthreshold = 0.8, waitingPeriod = 200. The average fitness of the popula- 
tion of Qc'** over 7000 generations is shown in Figure 3b, and the number of loci that the 
clamping mechanism left unmutated in each generation is shown in Figure 3c. Once again, 
the growth in the maximum and average fitness of the population tapered off by generation 
1000. However, the maximum and average fitness began to grow once again starting at 
generation 2200. This growth coincides with the commencement of the clamping of loci 
(compare Figures 3b and 3c). 

4.2 Validation on an SK Spin Glasses System 

A Sherrington Kirkpatrick Spin Glasses system is a set of coupling constants Jij, with 
1 < * < J < ^- Given a configuration of "spins" (a"i, . . . , o"^), where each spin is a value in 
{+1,-1}, the "energy" of the system is given by 



. The goal is to find a spin configuration that minimizes energy. By defining the fitness 
of some spin configuration a to be —E{a) we remain true to the conventional goal in 
genetic algorithmics of maximizing fitness. The coupling constants in J can either be drawn 
from the set {—1,0, -|-1}, or from the gaussian distribution ^"(0, 1). Following Pelikan et 
al. (2008), we used coupling constants drawn from A^(0, 1). Each chromosome in the 
evolving population straightforwardly represented a spin configuration, with the bits 1 and 



1000. 




i<j<j<i 
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denoting the spins +1 and —1 respectively^. The UGAs Q and Qc (described in the 
previous subsection) were applied to a randomly generated Sherrington Kirkpatrik spin 
glasses system over 1000 spins, denoted spin. The results obtain (Figures 3d, 3e, and 3f) 
were similar to the results described in the previous subsection. 

It should be said that clamping by itself does not cause decimation. It merely enforces 
strict decimation once a high degree of decimation has already occurred along some di- 
mension. In other words, clamping can be viewed as a decimation "lock-in" mechanism as 
opposed to a decimation enforcing mechanism. Thus, the occurrence of clamping shown in 
Figure 3 entails the occurrence of decimation. The effectiveness of clamping demonstrated 
above lends considerable support to the hyperclimbing hypothesis. More support of this 
kind can be found in the work of Huifang and Mo (2010) where the use of clamping improved 
the performance of a UGA on a completely different problem (optimizing the weights of a 
quantum neural network) . A fair portion of the scientific usefulness of these experiments is 
attributable to the utter simplicity of clamping. Reasoning within the rubric of the hyper- 
cling hypothesis, it not difficult to think of adjustments to the UGA that are more effective, 
but also more complex. From an engineering standpoint the additional complexity would 
indeed be warranted. From a scientific perspective, however, the additional complexity is a 
liability because it might introduce suspicion that the adjustments work for reasons other 
than the one offered here. 

5. Conclusion 

Simple genetic algorithms with uniform crossover (UGAs) perform adaptation by implicitly 
exploiting one or more features common to the fitness distributions arising in practice. Two 
key questions are i) What type of features? and ii) How are these features exploited by 
the UGA (i.e. what heuristic does the UGA implicitly implement)? The hyperclimbing 
hypothesis is the first scientific theory to venture answers to these questions. In doing so it 
challenges two commonly held views about the conditions necessary for a genetic algorithm 
to be effective: First, that the fitness distribution must have a building block structure 
(Goldberg, 2002; Watson, 2006). Second, that a genetic algorithm will be ineffective un- 
less it makes use of a "linkage learning" mechanism (Goldberg, 2002). Support for the 
hyperclimbing hypothesis was presented in the proof of concept and validation sections of 
this article. Additional support for this hypothesis can be found in i) the weakness of 
the assumptions undergirding this hypothesis (compared to the building block hypothesis, 
the hyperclimbing hypothesis rests on weaker assumptions about the distribution of fitness 
over the search space; see Burjorjee 2009), ii) the computational efficiencies of the UGA 
rigorously identified in an earlier work (Burjorjee, 2009, Chapter 3), and iii) the utility of 
clamping reported by Huifang and Mo (2010). 

If the hyperclimbing heuristic is sound, then the idea of a landscape (Wright, 1932; 
Kauffman, 1993) is not very useful for intuiting the behavior of UGAs. Far more useful 
is the notion of a hyperscape. Landscapes and hyperscapes are both just ways of concep- 
tualizing fitness functions geometrically. Whereas landscapes draw one's attention to the 

4. Given an n x £ matrix P representing a population of n spin configurations, each of size £, the energies 
of the spin configurations can be expressed compactly as —PJ^P^ where J is &n Ix i upper triangular 
matrix containing the coupling constants of the SK system. 
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Figure 3: (a,b) The performance, over 10 trials, of the UGAs Q and the UGA Qc on a 
randomly generated instance of the Uniform Random 3SAT problem with 1000 
variables and 4000 clauses. The mean (across trials) of the average fitness of the 
population is shown in black. The mean of the best-of-population fitness is shown 
in blue. Errorbars show five standard errors above and below the mean every 
400 generations, (c) The mean number of loci left unmutated by the clamping 
mechanism used by Qc- Errorbars show three standard errors above and below the 
mean every 400 generations. The vertical dotted line marks generation 2200 in all 
three plots. (d,e,f) Same as above, but but for a randomly generated Sherrington 
Kirkpatrick Spin Glasses System over 1000 spins (see main text for details) 



interplay between the fitness function and the neighborhood structure of individual points, 
hyperscapes are about the statistical fitness properties of individual hyperplanes, and the 
"spatial" relationships between hyperplanes — lower order hyperplanes can contain higher 
order hyperplanes, hyperplanes can intersect each other, and disjoint hyperplanes belonging 
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to the same hyperplane partition can be regarded as parallel. The use of hyperscapcs for 
intuiting GA dynamics originated with Holland (1975) and was popularized by Goldberg 
(1989). 

Useful as it may be as an explanation for adaptation in UGAs, the ultimate value of 
the hyperclimbing hypothesis may lie in its generalizability. In a previous work (Burjorjee, 
2009), the notion of a unit of inheritance — i.e. a gene — was used to generalize this hypoth- 
esis to account for adaptation in simple genetic algorithms with strong linkage between 
chromosomal loci. It may be possible for the hyperclimbing hypothesis to be generalized 
further to account for adaptation in other kinds of evolutionary algorithms, In general, such 
algorithms may perform adaptation by efficiently identifying and progressively fixing above 
average "aspects" — units of selection in evolutionary biology speak — of the chromosomes 
under evolution. The precise nature of the unit of selection in each case would need to be 
determined. 

If the hyperclimbing hypothesis and its generalizations are sound we would finally have a 
unified explanation for adaptation in evolutionary algorithms. Fundamental advances in the 
invention, application, and further analysis of these algorithms can be expected to follow. 
The field of global optimization would be an immediate beneficiary. In turn, a range of fields, 
including machine learning, drug discovery, and operations research stand to benefit. Take 
machine learning for instance. Machine learning problems that can be tackled today are, 
in large part, ones that are reducible in practice to convex optimization problems (Bennett 
and Parr ado-Hernandez, 2006). The identification of an intuitive, efficiently implementable, 
general purpose meta-heuristic for optimization over rugged, dynamic, and stochastic cost 
functions promises to significantly extend the reach of this field. 

Finally, we briefly touch on the interdisciplinary contribution that the hyperclimbing 
hypothesis makes to a longstanding debate about the units of selection in biological popu- 
lations (Okasha, 2006; Dawkins, 1999a, b). The material presented in the proof of concept 
section of this paper, and especially the material in Chapter 3 of an earlier work (Burjorjee, 
2009) suggest that the most basic unit of selection is, not the individual gene as is com- 
monly thought, but a small set of genes. Chapter 3 of the earlier work (Burjorjee, 2009) 
demonstrates conclusively that as a unit of selection, the latter is not always reducible to 
instances of the former. In other words, it gives the lie to the common refrain in Population 
Genetics that multi-gene interactions can be ignored when studying adaptation in biological 
populations because "additive effects are the basis for selection" (Wagner, 2002). 
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Appendix A. The Hyperclimbing Heuristic: Formal Description 



Introducing new terminology and notation where necessary, we present a formal description 

of the hyperclimbing heuristic. For any positive integer i, let [i] denote the set {1, . . . ,i}, 

and let denote the set of all binary strings of length 1. For any binary string g, let gi 

denote the i*'' bit of g. We define the schema partition model set of i, denoted SPM^, to 

be the power set of [£], and define the schema model set of £, denoted SM^, to be the set 

{h : D ^ {0, 1}\D G SPM^}. Let §£ and SP^ be the set of all schemata and schema partitions 

Mitchell (1996), respectively, of the set *B^. Given some schema 7 C 53^, let tt{j) denote the 

set {i G [i] |Vx,?/ G 7, Xj = yi}. We define a schema modeling function SMF^ : SM^ 

as follows: for any 7 G Eg, SMF^ maps 7 to the function h : 7r(7) {0, 1} such that for 

any 5 G 7 and any i G vr(7), h(i) = g^. We define a schema partition modeling function 

SPMF<? : SP£ SPMf as follows: for any F G SP^, SPMF^(F) = 71(7), where 7 G F. As 

7r(V') = 7r(^) for all il),^ ^T, the schema partition modeling function is well defined. It is 

easily seen that SPF^ and SPMF^ are both bijective. For any schema model h G SM^, we 

denote SMF^"^(/i) by \h\g. Likewise, for any schema partition model S G SPM^ we denote 

SPMF^^(S') by \S\f^. Going in the forward direction, for any schema 7 G S^, we denote 

SMF£(7) by (7). Likewise, for any schema partition F G SP^, we denote SPMF£(F) by (F). 

We drop the £ when going in this direction, because its value in each case is ascertainable 

from the operand. For any schema partition F, and any schema 7 G F, the order of F, and 

the order of 7 is |(F)|. 

For any two schema partitions Fi, F2 G SP^, wc say that Fi and F2 are orthogonal if the 

models of Fi and F2 are disjoint (i.e., (Fi) n (F2) = 0). Let Fi and F2 be orthogonal schema 

partitions in SP^, and let 71 G Fi and 72 G F2 be two schemata. Then the concatenation 

F1F2 denotes the schema partition [(Fi) U (F2)]l^, and the concatenation 7172 denotes the 

schema \h : (Fi) U (F2) — )■ {0, such that for any i G (Fi), h{i) = (7i)(i), and for any 

i G (F2), h{i) = (72)(i)- Since (Fi) and (F2) are disjoint, 7172 is well defined. Let Fi and 

F2 be orthogonal schema partitions, and let 71 G Fi be some schema. Then 7.F2 denotes 

the set {7^ G FiF2|^ G F2}. 

Given some (possibly stochastic) fitness function / over the set *B^, and some schema 

(f) 

7 G §£, we define the fitness of 7, denoted \ to be a random variable that gives the 
fitness value of a binary string drawn from the uniform distribution over 7. For any schema 
partition F G SP^, we define the effect of F, denoted EfFect[F], to be the variance^ of the 
expected fitness values of the schemata of F. In other words. 



Let Fi,F2 G SFi be schema partitions such that (Fi) C (F2). It is easily seen that 
Effect[Fi] < Efrect[F2]. With equality if and only if F^^-^^ = F^^^ for all schemata 71 G Fi 
and 72 G F2 such that 72 C 71. This condition is unlikely to arise in practice; therefore, 
for all practical purposes, the effect of a given schema partition decreases as the partition 
becomes coarser. The schema partition [ [I] ]^ has the maximum efi^ect. Let F and ^ be two 

5. Wc use variance because it is a well known measure of dispersion. Other measures of dispersion may 
well be substituted here without affecting the discussion 
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orthogonal schema partitions, and let 7 G T be some schema . We define the conditional 
effect of ^ given 7, denoted Effect[^'|7], as follows: 



A hypcrclimbing heuristic works by evaluating the fitness of samples drawn initially 
from the uniform distribution over the search space. It finds a coarse schema partition 
r with a non-zero effect, and limits future sampling to some schema 7 of this partition 
whose average sampling fitness is greater then the mean of the average sampling fitness 
values of the schemata in T. By limiting future sampling in this way, the heuristic raises 
the expected fitness of all future samples. The heuristic limits future sampling to some 
schema by fixing the defining bits Mitchell (1996) of that schema in all future samples. The 
unfixed loci constitute a new (smaller) search space to which the hypcrclimbing heuristic 
is then recursively applied. Crucially, coarse schema partitions orthogonal to T that have 
undetectable unconditional effects, may have detectable effects when conditioned by 7. 

Appendix B. Visualizing Staircase Functions 

The stages of a staircase function can be visualized as a progression of nested hyperplanes^ , 
with hyperplanes of higher order and higher expected fitness nested within hyperplanes of 

lower order and lower expected fitness. By choosing an appropriate scheme for mapping a 
high- dimensional hypcrcube onto a two dimensional plot, it becomes possible to visualize 
this progression of hyperplanes in two dimensions (Appendix B). 

Definition 2 A refractal addressing system is a tuple {m,n,X,Y), where m and n are 
positive integers, and X and Y are matrices with m rows and n columns such that the 
elements in X and Y are distinct positive integers from the set [2mn\, such that for any 
k G [2mn\ , k is in X k is not inY (i. e. the elements of [2mn] are evenly split between 



The refractal addressing system (m, o, X, Y) determines how the set ^2mn gets mapped 
onto a 2™" x 2"^" grid of pixels. For any bitstring g G 252mn the xy-address (a tuple of two 
values, each between 1 and 2"*") of the pixel representing g is given by Algorithm 3. 
Example: Let (/i = 4, o = 2, (5 = 3, £ = 16, L, V) be the descriptor of a staircase function 
/, such that 



Let A = (m = 4, n = 2,X, y) be a refractal addressing system such that Xi- = Li-, 
Yi: = L2:, X2: = Ls:, and Y2: = L^-,. A refractal plot^ of / is shown in Figure 4a. 

6. A hyperplane is a geometrical representation of a schema (Goldberg, 1989, p 53). 

7. The term "refractal plot" describes the images that result when dimensional stacking is combined with 
pixelation Langton et al. (2006). 
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Algorithm 3: The algorithm for determining the {x, y)-address of a chromosome 
under the refractal addressing system (m, n, X, Y). The function BiN-To-iNT returns 
the integer value of a binary string. 
Input: 5 is a chromosome of length 2mn 

granularity 2"*"/2" 
x ^ 1 

y ^ 1 

for z 1 to m do 

X X + granularity * Bin-To-1nt (Hx^. (g)) 
y y + granularity * Bin-To-Int (Sy.. (q')) 
granularity ^ granularity 12"^ 
end 

return x,y 



This image was generated by querying / with every bitstring in QSig, and plotting the 
resulting fitness value of each chromosome as a greyscale pixel at the chromosome's refractal 
address under the addressing system A. The fitness values returned by / have been scaled 
to use the full range of possible greyscale shades^. Lighter shades signify greater fitness. 
The four stages of / can easily be discerned. 

Suppose we generate another refractal plot of / using the same addressing system A, 
but a different random number generator seed; because / is stochastic, the greyscale value 
of any pixel in the resulting plot will then most likely difi^er from that of its homolog in the 
plot shown in Figure 4a. Nevertheless, our ability to discern the stages of / would not be 
affected. In the same vein, note that when specifying A, we have not specified the values 
of the last two rows of X and Y; given the definition of / it is easily seen that these values 
are immaterial to the discernment of its "staircase structure" . 

On the other hand, the values of the first two rows of X and Y are highly relevant to the 
discernment of this structure. Figure 4b shows a refractal plot of / that was obtained using 
a refractal addressing system A' = {m = 4,n = 2,X',Y') such that X'^. = Li-, Y^. = L2:, 
X'^. = L3:, and Y^. = L^-. Nothing remotely resembling a staircase is visible in this plot. 

The lesson here is that the discernment of the fitness staircase inherent within a staircase 
function depends critically on how one 'looks' at this function. In determining the 'right' 
way to look at / we have used information about the descriptor of /, specifically the values 
of h, o, and L. This information will not be available to an algorithm which only has query 
access to /. 

Even if one knows the right way to look at a staircase function, the discernment of the 
fitness staircase inherent within this function can still be made difficult by a low value of the 
increment parameter. Figure 5 lets us visualize the decrease in the salience of the fitness 
staircase of / that accompanies a decrease in the increment parameter of this staircase 
function. In general, a decrease in the increment results in a decrease in the 'contrast' 

8. We used the Matlab function imagescO 



18 



64 128 192 256 64 128 192 256 

(a) (b) 



Figure 4: A refractal plot of the staircase function / under the refractal addressing systems 
A {left) and A' (right). 




64 128 192 256 64 128 192 256 



Figure 5: Refractal plots under A of two staircase functions, which differ from / only in 
their increments — 1 (left plot) and 0.3 (right plot) as opposed to 3. 



between the stages of that function, and an increase the amount of computation required 
to discern these stages. 
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Appendix C. Analysis of Staircase Functions 



Let i be some positive integer. Given some (possibly stochastic) fitness function / over the 
set 53^, and some schema 7 C we define the fitness signal of 7, denoted S'(7), to be 
E[Fj ] — E[F^^]. Let 71 C and 72 C be schemata in two orthogonal schema 
partitions. We define the conditional fitness signal of 71 given 72, denoted 5(71 | 72), 
to be the difference between the fitness signal of 7172 and the fitness signal of 72, i.e. 
'S'(7i I 72) = '9(7172) — S'(72). Given some staircase function / we denote the z*'* step of / 
by L/J i denote the i*'' stage of / by [/] j . 

Let / be a staircase function with descriptor (h,o,S,i,L,V). For any integer i £ [h], 
the fitness signal of [/Jj is one measure of the difficulty of "directly" identifying step i (i.e., 
the difficulty of determining step i without first determining any of the preceding steps 
1, . . . ,i — I). Likewise, for any integers i,j in [h] such that i > j, the conditional fitness 
signal of step i given stage j is one measure of the difficulty of "directly" identifying step 
i given stage j (i.e. the difficulty of determining [/Jj given \f]j without first determining 
any of the intermediate steps [/Jj+i, • ■ ■ , [/Ji-i. 

For any i e [h], by Theorem 1 (see below), the unconditional fitness signal of step i is 



This value decreases exponentially with i and o. It is reasonable, therefore, to suspect that 
the direct identification of step i oi f quickly becomes infeasiblc with increases in i and 
o. Consider, however, that by Corollary 1, for any i G {2,...,/i}, the conditional fitness 
signal of step i given stage {i — 1) is 5, a constant with respect to i. Therefore, if some 
algorithm can identify the first step of /, one should be able to use it to indirectly identify 
all remaining steps in time and fitness queries that scale linearly with the height of /. 

Lemma 1 For any staircase function f with descriptor {h,o,d,£,L,V), and any integer 
i E [h], the fitness signal of stage i is iS. 

Proof: Let x be the expected fitness of under uniform sampling. We first prove the 
following claim: 

Claim 1 The fitness signal of stage i is iS — x 

The proof of the claim follows by induction on i. The base case, when i = h is easily seen 
to be true from the definition of a staircase function. For any k G {2,. . . ,h}, we assume 
that the hypothesis holds foi i = k, and prove that it holds ioi i = k — 1. For any j G [h], 
let Fj G denote the schema partition containing step i. The fitness signal of stage k — 1 
is given by 



S 



2o(i-l) 
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where the first term of tlic right liand side of the above expression follows from the in- 
ductive hypothesis, and the second term follows from the definition of a staircase function. 
Manipulation of this expression yields 

k6 + {2" - l)S{k - 1) - 6 - 2''x 
2" 

which, upon further manipulation, yields (k — 1)6 — x. 

This completes the proof of the claim. To prove the lemma, we must prove that x is 
zero. By claim 1, the fitness signal of the first stage is S — x. By the definition of a staircase 
function then, 

_S-x 2°-l/_ 6 \ 
^ ~ 2° ^ 2° V 2°- 1 ) 

Which reduces to 

X 

2° 

Clearly, x is zero. □ 

Corollary 1 For any i G {2, . . . , h}, the conditional fitness signal of step i given stage i — 1 
is S 



Proof The conditional fitness signal of step i given stage i — 1 is given by 

s{[fu I m.-i) 
= 5(r/i.)-5(m.-i) 

= ii6 - (i - 1)6) 
= 6D 



Theorem 1 For any staircase function f with descriptor {h, o, 6, a, £, L, V), and any integer 
i G [h], the fitness signal of step i is 6/2°^^~^h 

Proof: For any j G [h], let Aj G SP^ denote of the partition containing stage j, and let 
Tj G SP^ denote of the partition containing step j. We first prove the following claim 

Claim 2 For any z G [h], 

«6AA{r/ii} 

The proof of the claim follows by induction on i. The proof for the base case (i = 1) is as 
follows: 

J2 SiO = (2" - 1) 

^eAiUr/li} 
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For any k E [h — 1] we assume that the hypothesis holds iov i = k, and prove that it holds 
for z = A; + 1. 

^eAk+l\{\f^k+l} 

E s{\f],^i;)+ Yl E ^(«^) 
E E E 

= (2°-l)S(r/lfc) + 2<'[ J2 ^(0 

where the first and last equalities follow from the definition of a staircase function. Using 
Lemma 1 and the inductive hypothesis, the right hand side of this expression can be seen 
to equal 

{2" - 1) - ^) - 2°M 

which, upon manipulation, yields —d(k + l). 

For a proof of the theorem, observe that step 1 and stage 1 are the same schema. So, 
by Lemma 1, ^([/Ji) = 6. Thus, the theorem holds for i = 1. For any i e {2,. . . , h}, 



S{[f\i) = is{\f]i)+ E mf\k)] 
> V e6Ai_i\{r/i,_i} / 



(2 

where the last equality follows from the definition of a staircase function. Using Lemma 1 
and Claim 2, the right hand side of this equality can be seen to equal 

i5-{i- 1)5 
{2°Y-^ 

2o(i-i) 
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